Unsupervised Instance Selection from Text Streams

نویسندگان

  • Rafael Bonin
  • Ricardo M. Marcacini
  • Solange Oliveira Rezende
چکیده

Instance selection techniques have received great attention in the literature, since they are very useful to identify a subset of instances (textual documents) that adequately represents the knowledge embedded in the entire text database. Most of the instance selection techniques are supervised, i.e., requires a labeled data set to define, with the help of classifiers, the separation boundaries of the data. However, manual labeling of the instances requires an intense human effort that is impractical when dealing with text streams. In this article, we present an approach for unsupervised instance selection from text streams. In our approach, text clustering methods are used to define the separation boundaries, thereby separating regions of high data density. The most representative instances of each cluster, which are the centers of high-density regions, are selected to represent a portion of the data. A well-known algorithm for data sampling from streams, known as Reservoir Sampling, has been adapted to incorporate the unsupervised instance selection. We carried out an experimental evaluations using three benchmarking text collections and the reported experimental results show that the proposed approach significantly increases the quality of a knowledge extraction task by using more representative instances.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy

Text classification, the task of metadata to documents, requires significant time and effort when performed by humans. Moreover, with online-generated content explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Currently, lots of state-or-art text mining methods have been applied to classification process, many of them based on the key wor...

متن کامل

Scaled Entropy and DF-SE: Different and Improved Unsupervised Feature Selection Techniques for Text Clustering

Unsupervised feature selection techniques for text data are gaining more and more attention over the last few years. Text data is different from structured data, both in origin and content, and they have some special differentiating properties from other types of data. In this work we analyze some such features and exploit them to propose a new unsupervised feature selection technique called Sc...

متن کامل

Event Detection in Social Streams

Social networks generate a large amount of text content over time because of continuous interaction between participants. The mining of such social streams is more challenging than traditional text streams, because of the presence of both text content and implicit network structure within the stream. The problem of event detection is also closely related to clustering, because the events can on...

متن کامل

Unsupervised feature selection for sparse data

Feature selection is a well-known problem in machine learning and pattern recognition. Many high-dimensional datasets are sparse, that is, many features have zero value. In some cases, we do not known the class label for some (or even all) patterns in the dataset, leading us to semi-supervised or unsupervised learning problems. For instance, in text classification with the bag-of-words (BoW) re...

متن کامل

Scaling Data Linkage Generation with Domain-Independent Candidate Selection

We propose a candidate selection algorithm for scalably detecting coreferent instance pairs from heterogeneous Semantic Web data sources. Our algorithm selects candidate pairs by computing a characterlevel similarity on disambiguating literal values that are chosen using domain-independent unsupervised learning. We index the instances on such values to efficiently look up similar instances. Our...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • JIDM

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2014